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Abstract —In this paper, we propose a generalized scale mixture 
family of distributions, namely the Power Exponential Scale 
Mixture (PESM) family, to model the sparsity inducing priors 
currently in use for sparse signal recovery (SSR). We show that 
the successful and popular methods such as LASSO, Reweighted 
i\ and Re weigh ted £2 methods can be formulated in an unified 
manner in a maximum a posteriori (MAP) or Type I Bayesian 
framework using an appropriate member of the PESM family 
as the sparsity inducing prior. In addition, exploiting the natural 
hierarchical framework induced by the PESM family, we utilize 
these priors in a Type II framework and develop the corresponding 
EM based estimation algorithms. Some insight into the differences 
between Type I and Type II methods is provided and of particular 
interest in the algorithmic development is the Type II variant of the 
popular and successful reweighted £\ method. Extensive empirical 
results are provided and they show that the Type II methods exhibit 
better support recovery than the corresponding Type I methods. 

Index Terms —Sparse Bayesian Learning, LASSO, Reweighted 
£ 1 , Re weighted £ 2 , Gaussian Scale Mixture 

I. Introduction 

Sparse signal recovery (SSR), i.e, finding sparse signal 
representations from overcomplete dictionaries, has become 
a very active research area in recent times because of its 
wide range engineering applications and interesting theoretical 
nature [Q, HJ, [3], @. 

A. Problem Formulation 

The SSR problem involves solving an under-determined 
system of equations y = $x, where vector y is the N x 1 
measurement vector and <f> is the N x M overcomplete dic¬ 
tionary, where M > TV,and it is assumed that Spark(<l>) = N. 
€> are often formed from a physically meaningful model and 
the vector x is the Mxl vector of interest. As the system has 
fewer equations than unknowns, it can have infinitely many 
solutions and thus additional information is needed to identify 
which of these candidate solutions is indeed the appropriate 
one for the problem at hand. In the SSR problem, it will be 
assumed that the solution of interest is sparse, i.e. most of the 
entries are zero. Ideally one can recover the optimal sparsest 
solution xo by solving the following £0 optimization problem 
0 

min||x||o such that y = $x, (1) 

X 

where ||x||o is a measure of the support of x. In practice, mea¬ 
surements are generally corrupted by noise, which motivates 
the following modified optimization problem: 

min||y-$x[|! + A||x|| 0 (2) 

X 
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where, A > 0 is related to the measurement noise variance. 

However, the above optimization problem is not convex 
and is known to be NP-hard. For computational tractability, 
the original penalty factor ||xj|o is often approximated by a 
suitable surrogate p(x) leading to the optimization problem 

min||y - $x||l + Ag(x) (3) 

X 

Different choices of the penalty factor <?(x), also referred to 
here as diversity measure, lead to different SSR algorithms 
0, 0, 0. 0, and it has been shown that the choice of 
a strictly concave penalty factor on the positive orthant leads 
to a objective function with local minima being sparse and 
sparsest solution as a global minimum under some conditions 
0 . Majorization-Minimization (TOj can be employed to solve 
this optimization problem for such penalty functions, and this 
has led to the development of useful and effective reweighted 
norm minimization algorithms. Typically, l\ and £2 norms are 
selected because of their convex nature and the later because 
of the closed form solution at each iteration. 

B. Related Literature 

Minimizing diversity measures p(x) to recover the sparse 
representations has been a popular algorithm exploration av¬ 
enue. In this framework, the SSR problem formulation can 
also be viewed as a regularization approach to signal re¬ 
construction. A popular approach among this class is the l v 
norm minimization based methods, p = 1 leads to a tractable 
and computationally attractive convex optimization problem 
and the very well known approaches such as Basis Pursuit, 
LASSO are based on the £\ framework 111 II . 1121. Other than 
the convexity property, i\ based approaches have been sup¬ 
ported by theoretical guarantees of exact recovery given some 
conditions on the overcomplete dictionary 0, which makes 
these approaches attractive options. The recently proposed 
reweighted £\ and £2 norm minimization approaches (6), (5j, 
DU have empirically shown superior recovery performance 
over l\ minimization and are considered in this work. 

In addition to the regularization framework, another options 
for SSR algorithm development is the Bayesian framework 

da, na. ge ge ge ge |20| . In a Bayesian framework, 

the sparsity constraint is incorporated by choosing a suitable 
sparse prior on the coefficient vector x. In a Bayesian setting, 
there are two popular avenues for algorithm development: a 
Type I MAP based approach, and a Type II Evidence Maxi¬ 
mization approach involving a Hierarchical model. Most of the 
approaches discussed above, based on |3j, can be interpreted 
and cast in a suitable Type I framework. A Type II framework 
has been considered in CLQ, GE where a Relevance Vector 
Machine is adapted to the problem at hand. In on, in, 
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m a Type II optimization problem has been transformed into 
a Type I problem by employing a suitable penalty function 
and reweighted norm minimization algorithm is developed 
to solve the resulting optimization problem. Following the 
Type II framework, a Laplacian prior which corresponds to 
£\ norm minimization can also be represented in a Hierarchy 
using a Gaussian Scale Mixture (GSM) representation OH, 
EH. In the statistics community, the well known Bayesian 
Lasso 1261 also makes use of the equivalence of a hierarchical 
Gaussian-Exponential prior to the Laplace prior, and conducts 
a fully Bayesian inference (via Markov chain Monte Carlo 
or MCMC sampling algorithms). Demi-Bayesian Lasso ED 
solves the same problem using a Type II approach. It has been 
shown empirically that a Type II methods performs consistently 
better than Type I, i.e the MAP estimation approach, and 
theoretical analysis in support for this superiority has recently 
begun to appear. However, much remains to be done and 
this work is an attempt in this direction. In 128) . the two 
different frameworks are analyzed in a generalized Hierarchical 
Bayesian setting which motivates us to analyze these two 
frameworks for the specific SSR problem to gather additional 
insights by exploiting domain knowledge. In (291 , Type I and 
Type II frameworks for SSR were introduced using two forms 
of density representation, a convex representation and a GSM 
representation, to provide an unified treatment. We build on this 
work and employ a generalized scale mixture representation 
to establish connections and develop enhancements to popular 
SSR algorithms, as well as treat both l\ and £2 variants in an 
unified manner. 

As mentioned above, a key ingredient behind the Type II 
methods is the Scale Mixture/Hierarchical representation of 
the super gaussian priors, which allows one to design efficient 
algorithms conveniently. Gaussian Scale Mixtures (GSM) 1301 . 
ED, 129] and Laplacian Scale mixtures (LSM) E3 have been 
studied before in the context of sparsity. In this work we 
discuss a more general Scale Mixture framework, the Power 
Exponential Scale Mixture (PESM) family, for SSR algorithm 
development. The PESM representation includes the popular 
GSM and LSM as special cases and provides a mechanism to 
provide a unified view of the popular £ 1 and £2 frameworks 
currently employed. This work will emphasize the generalized 
t (GT) distribution family of priors, a member of PESM, since 
it has a wide range of tail shapes, and also includes the heavy 
tailed super gaussian distributions. GT family of distributions 
have been mentioned in statistics literatures for design of robust 
regressors for several financial modeling tasks, where the heavy 
tail nature of GT helps to model the outliers (33), (34). 

C. Contributions of the Paper 

• We discuss a generalized scale mixture framework, the 
power exponential scale mixture (PESM) family, and 
show how many of the super gaussian densities used in 
practice can be represented using this framework. 

• We summarize two types of Bayesian frameworks, i.e. 
Type I and Type II for SSR in detail, along with providing 
connections to traditional norm minimization approaches 
by suitable choice of sparse prior distributions. Of partic¬ 
ular importance is the treatment of the diversity measure 


used in connection with the reweighted £\ algorithm as 
well as an unified treatment of both £\ and £2 based 
approaches. 

• We formulate and unify three well known diversity min¬ 
imization based SSR algorithms in the PESM framework 
and derive the Type I and Type II versions of them. 
Of particular interest is the Type II counterpart of the 
reweighted £\ algorithm @. 

• We analyze the difference between Type I and Type II 
inference procedures and our analysis shows the funda¬ 
mental difference between these two frameworks and also 
helps to understand a potential reason for the empirical 
superiority of Type II methods over Type I. 

• Extensive empirical experimentation results are presented 
to support the superiority claim of Type II methods over 
their Type I counterpart. 

D. Article Organization 

The rest of the paper is organized as follows. In Section [Il-A[ 
a generalized scale mixture representation, the Power Expo¬ 
nential Scale Mixtures (PESM) family, is presented which are 
of main importance to design Bayesian methods for SSR. In 
Section III we discuss Type I/MAP algorithms for SSR and 
derive a unified inference procedure and provide connection 
with three well known SSR algorithms. In Section IV we 
discuss Type II framework for SSR along with analyzing the 
fundamental difference between Type I and Type II algorithms. 
The EM based inference procedure for Type II algorithms 
to estimate the coefficient vector and the hyperparameters is 
also developed which includes the counterpart to the popular 
reweighted £\ method. We present experimental results of the 
proposed algorithms in Section V in different settings and 
finally conclusions are presented in Section VI. 

II. Scale Mixture Distributions 

Scale mixture distributions namely Gaussian Scale mixtures 
(GSM) and Laplacian Scale mixtures (LSM) have gained lot 
of attention in recent years because of their ability to represent 
complex heavy tailed super gaussian distributions in a simple 
hierarchical manner (30), ED, ED, ED- In the statistics 
community, robustness has been the major reason for the use 
of scale mixtures. In regression analysis, the method of least 
squares often fails because of the outliers in the data. The 
need to model the outliers motivates the use of heavy tailed 
distribution. 

A. Power Exponential Scale Mixture (PESM) distribution 

In this work, a more general Power Exponential Scale 
Mixture (PESM) distribution, which is a generalization of GSM 
and LSM, is presented. The PESM representation is then used 
to model the prior sparse distribution over the vector x and for 
sparse signal recovery algorithm development. 

Power exponential (PE) distributions were first introduced 
by Box and Tiao (1962) in the context of robust regression 
to deal with non-normality. PE distribution is symmetric about 
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Fig. 1: PESM: Generalized scale mixtures 


the origin and a zero mean PE distribution has the following 
parameterized form: 


PE(x;p, a) 


p e 


(_M )P 


2*r(§) 


(4) 


It is evident from the above given form, that p = 2 
results in the normal distribution, whereas p = 1 connects to 
the well known Double exponential or Laplacian distribution. 
p < 2 leads to distribution with heavier tails than the Gaussian 
distribution. 

PESM family of distributions refer to distributions that can 
be represented as follows: 



p(at|7)p(7)d7 


PE(x-,p,-f)p{-y)d'y (5) 


Choice of distributional parameter p along with different suit¬ 
able mixing densities, i.e p( 7 ), will lead to different distri¬ 
butions including the super gaussian distributions. Because of 
the scale mixture representation, the generation of the random 
variable X can be viewed in a hierarchy, i.e. generate 7 using 
p( 7 ) followed by generating X using p(x I 7 ). The framework 
allows for dealing with complicated models in a simple manner 
and is indispensable as we move towards complex problems 
with structure. 

As special cases, the choice of p = 2 leads to Gaussian 
Scale Mixtures (GSM) which has been very popular in the 
literature, and p = 1 leads to the Laplacian Scale Mixtures 
(LSM). Interestingly, a Laplacian distribution p(x) = | e ~ a ^ 
can be represented as a GSM with exponential mixing density 
p( 7 ), i.e. p( 7 ) = exp(— ^- 7 )«( 7 ), where u(.) is the unit 
step function OH. More explicitly, 

POO 

p(x) = / p(*|7)p(7)d7 

Jo 

roo -1 2 2 2 

= f 0 T2^ exp( -^ )x T exp( -y 7)d7 (6) 

& —alail 
= 2 6 

This means, any LSM can also be represented as a GSM with 
an extra layer of hierarchy. This will play an important role 
in the SSR algorithm development. This fact also leads to 
the observation of the relationship between the different scale 
mixture families as depicted in Figure [7] 


B. An example of PESM: Generalized t Distribution 

In this example, we will consider an inverse generalized 
gamma (GG) distribution as our mixing density p( 7 ) in the 
hierarchical representation 0 for the PESM family. It leads to 
a generalized t distribution which is a superset of all the sparse 
distributions that have been used in practice in several recent 
works, e.g. Generalized double Pareto (GDP), Laplacian and 
Student-t distributions, among others. 

The Generalized t Distribution has the form: 


GT(x;a,p,q ) 


V 


(1 + 


qaP ) 


(V) 


Where 77 is the normalization constant, p and q are the shape 
parameters and a is the scale parameter. Interestingly, p and 
q provide the flexibility to represent different tail behavior 
using this distribution. Larger values of p and q correspond 
to thin tailed distributions whereas smaller values of p and q 
are associated with heavy tailed distributions. 

As mentioned above, the GT distribution family can be rep¬ 
resented in PESM framework using £>( 7 ) = GG( 7 ; —p, a, q) 
where, 


GG(x; —p, a, q) = rj {a / x) pq+1 e ( 8 ) 

Interesting special case of note is p = 2, which leads to a 
student t distribution, a prior that has been used in the popular 
Sparse Bayesian Learning (SBL)/Relevance Vector Machine 
(RVM) work and can be decomposed as a Gaussian Scale 
mixture with inverse Gamma as the mixing density. Employing 
p = 1 leads to a Generalized Double Pareto distribution (GDP) 
discussed in 1351 which can be represented as a scale mixture 
of Laplacian following equation [5] 

In Table [T] we summarize some special cases that have 
been used for SSR that arise by different choices of the shape 
parameters of GT, i.e. p and q. 

Among Scale Mixtures, GSM in particular has gained a lot 
of interest over the years in the literature and the proposed 
PESM framework is an interesting generalization for SSR 
puiposes. As shown in f29l . GSM can only be used to represent 
supergaussian densities, i.e. distributions with positive kurtosis 
whereas PESM representation can also be used for subgaussian 
densities along with supergaussian densities. One example 
is the previously discussed generalized t distribution, which 
becomes a thin tailed subgaussian distribution for p > 2 
(q = 1, a = 1). Moreover, for the purposes of the SSR work, 
the general PESM allows one to treat both the LSM and GSM 
in a unified manner thereby enabling treatment of £\ and 
based algorithms in a unified manner. 

III. B-SSR: Type I 

Type I inference corresponds to standard MAP estimation 
technique in B-SSR. In this section we review the Type I 
framework and derive a Type I algorithm using PESM as 
the sparse prior. Then we specialize the result using the 
Generalized t distribution as the sparse prior and also show 
that the generalized algorithm reduces to well known SSR 
algorithms. 
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A. Background on MAP Estimation (Type I methods) 

Having chosen a sparsity enforcing distribution p(x), 

thereby allowing one to narrow the space of candidate solutions 
in a manner consistent with application-specific assumptions, a 
maximum a posteriori (MAP) estimator of x is then obtained 
as (Type I estimation) 

x = argmaxp(xjy) = arg maxp(y|x)p(x) 

x x (9) 

= argmax [logp(yjx) + logp(x)] 

X 

Using the Gaussian noise assumption, and a separable prior 
distribution p(x) = Y\iP{ x i), the MAP estimate is obtained 
by minimizing 

J(x) = |!$x-y||^ + A^« 7 (x ! ), (10) 

i 

where g(x) is determined by logp(*). Incorporating sparsity 
by enforcing a sparse (supergaussian) distribution as the prior, 
p(x), reduces to choosing g(.). It has been shown that g(.) 
which is symmetric, concave and nondecreasing functions 
on [0, oo) are useful choices in this context 1361 . Now, as 
discussed above, many of these sparse priors can be represented 
in a hierarchy and belong to the PESM family. 

In order to contrast with the Type II formulation to follow, 
with the PESM representation one can revisit the equation 0 
and note that Type I involves integrating out the hyperparam¬ 
eter 7 . 

x = argmaxp(x|y) 

X 

[ (ID 

= argmaxp(y|x) / p(x\~f)p(~/)d~f 

B. Unified Type I Inference Procedure 

In this section we derive the EM inference procedure for 
the PESM family in the Type I framework, i.e, we find the 
MAP estimate of x where a PESM has been employed for the 
sparsity inducing prior p(x). Because of the separable prior, 
the p(xi) have an independent scale mixture representation, 

ro o 

p(xi) = / p0i|7i)p(7i) d 7; (12) 

Jo 

For MAP estimation of x, we treat the 7 ;’s as hidden 
variables and employ an EM algorithm. The complete data 
log-likelihood can be written as, 


logp(y, x, 7 ) = logp(y|x) + logp(x| 7 ) + logp( 7 ) (13) 


To formulate the Q function, we need to find the conditional 
expectation of the complete data log-likelihood with respect to 
posterior of the hidden variables p( 7 |x, y) which reduces to 
p( 7 |x) by virtue of the Markovian property induced by the 
hierarchy, i.e. 7 —>■ x —t y. Since in the M step we need 
to maximize the Q function with respect to x, we are only 
concerned with the first two terms in ( | 1 3| ) and only the second 
term has dependencies on 7 ^. This is the only term we need 
to be concerned with during the E-step. Now from the scale 
mixture decomposition and considering the ith component of 
x, 


logp(a:;| 7 i) 


\ogPE( Xi ;p, 7 i) = - + constants (14) 

7i 


Hence, for determining the Q function we need the following 
conditional expectation, E li \ Xi [4p], 

To compute the concerned expectation we will use the fol¬ 
lowing trick. Differentiating inside the integral of the marginal 

P(Xi), 


^ n 00 

p'(xi) * / p(*i|7i)p(7i) d 7i 


= -px\xi\ p 1 sign(x») / A ? p(x i ,y i )dy i 
0 7 i 


-1 f° 

= -p X \xi\ p i sign(x i )p(a:i) / -^pfy^xfjdyi 

Jo 7i 


Hence, 


= -px \xi\ p 1 sign(xi)p(x i )E lilx . [^] 

7i 


? rj_] = _ P'(xi) _ 

j 7*N»L 7 PJ px \xi\ v ~ 1 &ign(xi)p(xi) 


(15) 


(16) 


and enables determining the Q function. Then the M step 
reduces to, 


-(fc+i) • 1 [| 

x' -argmm-— y 

X 2o z 


-^x \\ 2 + J 24 k) \xi\ P (IV) 


2 . , . ( k) 

Where cr is the variance of the measurement noise and w\ = 

E ~u\^ k) 

Following the traditional path of EM, the algorithm is an 
iterative one, i.e, in the E step the weights are computed and 
in the M step a weighted norm minimization is solved. This 
alternate procedure is carried out iteratively till convergence. 


C. Special cases of Type I using Generalized t distribution 
In this section we specialize the derived unified Type I EM 
algorithm with the generalized t distribution as p(xi). We can 
write p(xf) ~ exp(—f(xi)) where. 


f( Xi ) = (q + l/p)log(l+^L) 


Thus, 


E. 


, rJ_i = __ 

-r.K L^pJ p x |a; i |p- 1 sign(a;i) 


Substituting the value of f'(xi) we get, 


E 1 . |x-[4d = q + V P i 
L 7f J d aP + \Xi\ P 


(18) 


(19) 


( 20 ) 


So the M step will become, 


Wfc+i) 

x ’ — arg mm 


2^|| y -$x|| 2 + ^u;f ) | ;Ci | p (21) 


2 , . r/c') 

Where a is the variance of the measurement noise and w\ = 

tp r 1 1 _ 9+1/p 

fJ “ 

In following subsections we will show how with specific 
choices of the distribution parameters of the generalized t, we 
can derive well known Type I (MAP estimation) based SSR 
algorithms. 






5 


q 

P 

Prior Distribution 

Penalty Function 

SSR Algorithm 

q 00 

2 

Normal 

||® I 2 

Ridge Regression 

q —»■ 00 

1 

Laplacian 

||®||i 

LASSO 

q > 0 (degrees of freedom) 

2 

Student t distribution 

log(e + a; 2 ) 

Reweighted £2 (Chartrand’s) 

q > 0 (shape parameter) 

1 

Generalized Double Pareto 

log(e+|x|) 

Reweighted fi(Candes’s) 


TABLE I: Variants of GT distribution and their connection to Type I Algorithms 


1) LASSO (h-minimization) sns-- Interestingly we see from 
Table I that for specific values of the shape parameters (q —> 
oo and p = 1 ), a generalized t distribution can be used to 
represent a double exponential or Laplacian distribution. Now 
to relate with the unified Type I MAP estimation inference 
procedure, taking the limit as q —> oo and a = 1 in (| 20 j, 
we get Wi = 1. Hence in the M step we are just solving a l\ 
penalized regression once as the weights are not changing over 
iterations, which is essentially the LASSO algorithm. 


2) Reweighted li -minimization (Candes et al CT): The pop¬ 
ular reweighted £\ -minimization (Candes et al f5J) is a special 
case of the MAP estimation approach using a generalized t 
distribution as sparse prior. 

Selecting the parameters of the generalized t as follows; q = 
t,p = 1 , a = 1 , one obtains, 

p(xi\t) = GT(1, 1, e) = + ^ (e+1) (22) 

which when substituted in equation (JTO]), results in the follow¬ 
ing cost function, 

min||y - $a:||l + A V'logdxil + e) (23) 

X { ■* 


In a. the above mentioned cost function is optimized using 
a MM approach. Now substituting the distribution parameters 
in equation 1 20 \ the weights reduce to Wi = | . These are 

the same weights obtained in |'5] via a MM method and p = 1 
in Equation l |2 1 1 > results in a weighted £\ minimization problem 
with the weights being a function of the previous estimate. This 
special case of GT has been also called the Generalized Double 
Pareto (GDP) distribution in the literature HD. 

Following the scale mixture decomposition of the GT distri¬ 
bution, as shown in Equation |5](, since p = 1 we can represent 
the prior as a Laplacian Scale Mixture. 


f f 1 

p(x) = j p(x| 7 )p( 7 )d 7 = J —e i pirfdy, (24) 

where p( 7 ) = GG( 7 ;—1, l,e). This observation is summa¬ 
rized in the following lemma. 

Lemma 3.1: Let x ~ Laplacian( 0 , 7 ), 7 ~ 
GG( 7 ; — 1 , 1 , e)), then the resulting marginal density for 
x is GT(l,l,e). 


3) Reweighted £ 2 -minimization (®, HU): Another popular 
SSR algorithm, the reweighted £2 minimization can also be 
represented in a Bayesian Type I setting by employing a Stu¬ 
dent t distribution with degree of freedom 2 1 . This heavytailed 
sparse prior p{x) is again a special case of the generalized t 
distribution as shown in the table. 


p( Xi \e) = GT(V2, 2 , e) = ^ 2) (25) 

l 1 "T 26 ) 

The nature of the tail of the student t distribution is controlled 
by degrees of freedom parameter e and smaller values of e 
correspond to heavier tails. The associated diversity penalty 
factor is given by giycfi) = \og(xl 4- e). For a Type I inference 
procedure, we can utilize the unified approach discussed above 
in Section |III-C| and substitute the shape and scale parameters 
p = 2 , q _= e, a = \p2 of the generalized t distribution in 
Equation |20l to obtain, Wi = 21 + 777 ^- Since p = 2, Equation 
© leadsro the reweighted £2 minimization algorithm as 
discussed in 0 

IV. B-SSR: Type II (Evidence Maximization) 

The success of Type II approaches like SBL for SSR 
problems motivate the Type II approach for the general PESM 
family. As special cases, the three Type I algorithms discussed 
in Section |III-C| are explored in the Type II setting. We also 
analyze the difference between a Type I algorithm and its 
Type II counterpart which provides insights into the reasons 
for superior recovery performance of Type II methods. 

In a Type II procedure, instead of integrating out the hypepa- 
rameters 7 , we estimate them using an evidence maximization 
method, i.e. 

7 = argmaxp( 7 |y) = arg maxp( 7 )p(y| 7 ) 

“7 7 

r (26) 

= argmaxp( 7 ) / p(y|x)p(x| 7 )dx 

The evidence framework integrates over the coefficient vector 
x to obtain the evidence p(y| 7 ). This evidence is weighted 
by the hyperprior p{~i) and maximized over 7 . Once 7 is 
obtained, the relevant posterior p(xjy) is approximated, often 
as p(x|y; 7 ), and the mean of the approximated posterior is 
used as a point estimate. Sparsity is achieved by many of the 
7 i being zero 11211 . {22), j 23 8 . 

A. Unified Type II EM algorithm 

To solve the above mentioned optimization problem, we 
again employ the EM algorithm this time by treating x as 
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the hidden variable. As in Section |III-B| we assume a sparse 
prior p(x) from the PESM family has been utilized and that 
the measurement noise is Gaussian with variance <r 2 . 

Hence the Q function has the form, 

Qh) = -Ex|y; 7 ,o-2 [logp(y|x) + logp(x| 7 ) + logp( 7 )] 

^-+logp(7i)] 

V 7 i 

(27) 

Since in the M step we are only concerned with the terms 
involving 7 , examining them reveals that the E-step requires 
the computation of the following conditional expectation 

E^ t ^[\xA P ] =<\ Xi \ p > (28) 

In the M step we will maximize the Q function with respect 
to 7 ; to find the update rules. To illustrate, if we consider a 
non informative hyperprior, i.e, p( 7 i) = 1 , 

Qh) = ~ “ lQ g 7 i - < ^ > (29) 

i P V 

Taking the derivative of the Q function w.r.t 7 i and setting it 
to zero results in, 

7i = p < |xj| p > (30) 

Since the E step requires the computation of the conditional 
expectation given by Equation {28}, we can either look for a 
closed form solution or revert to the MCMC technique [26i|. 
We will examine this further for some special cases later. 


B. Difference between Type 1 and Type 11 inference methods 

Type I and Type II provide two different approaches to 
solving the SSR problem. Hence it is important to understand 
the theoretical differences between the two inference proce¬ 
dures to identify their suitability for SSR. In (37| , the authors 
provide evidence for SBL, using a variational approximation 
to the prior p(x), that Type II methods attempt to approximate 
the true posterior p(x|y). If the true posterior distribution 
has a skewed peak, then the type 1 estimate (MAP of x) is 
not a good representative of the whole posterior. By trying 
to approximate the true posterior mass. Type II methods are 
likely to provide a better estimate. Similar discussion of Type 
II desirability is provided in 1281 in the context of general 
Bayesian inferencing. We revisit the issue and attempt to 
corroborate this by exploiting specific attributes of the SSR 
problem. We first manipulate the Type II objective as shown 
below. 


p(7|y) = J Pi 7 > x |y)dx 

= J P(7|x,y)p(x|y)dx 
= J P(7l*)p(x|y)dx 


(31) 


Pi 7) J 


P( x |7) 

Pi? 


p(x|y)dj( 


Lets assume that 7 is the solution of Equation {26}. It will be 
sparse for specific choice of p( 7 ) as shown in f22| , (231 . 


Now, let S_ be the index of non zero entries and S be the 
index of zero entries. So, we can say 7 ^ = 0. 


P(7|y) = limp(7 + e|y) 

e—>-0 


= p( 7 ) lim / /_ 

£_>0 Js_ Js 


p(xs|ts + es)p(> 

p(xs)p(xg) 


^^p(x|y)dx 


(32) 


p(xgjeg) is a normal distribution with mean zero and variance 
Eg. Hence when —¥ 0, p(xgjeij) becomes a dirac delta 
function, i.e. 5(ajg). 

Using the properties of dirac delta functions inside the 
integration, we obtain 

- L iSf R$^f <X2 ’ Xy “ 0|y)<iX2 

(33) 


Hence from this analysis, we see that we are evaluating 
a weighted integral of the true posterior p(x|y) over the 
subspaces spanned by the non zero indexes. This shows that 
in the evidence maximization framework instead of looking 
for the mode of the true posterior p(x|y), we approximate the 
true posterior by p(xjy; 7 ) where 7 is obtained by maximizing 
the true posterior mass over the subspaces spanned by the non 
zero indexes. This is in contrast to Type 1 methods that seek 
the mode of the true posterior and use that as the point estimate 
of the desired coefficients. 

Another favorable aspect of the Type II framework is that 
it inherits the robustness property of a Hierarchical Bayesian 
modeling framework. It has been shown extensively in the 
statistics literature (38), |39j, 1401 . that the posterior of a 
hyperparameter, i.e, 7 , is less affected by the wrong choices 
of prior than the posterior of the parameter x. In other words, 
parameters that are deeper in the hierarchy have less effect on 
the inference procedure, which allows us to be less concerned 
about the choice of p( 7 ). Another virtue is that the hierarchical 
framework allows for parameter tying and this can greatly 
reduce the search space for Type II methods by leading to 
an optimization problem with fewer parameters. This is more 
evident for problems like the MMV and block sparsity problem 

ED. E21, ED- 


C. Special case of Unified Type II with different choices of p 

As discussed above for the unified Type II approach our con¬ 
cerned posterior is p(x|y; 7 , a 2 ). For a point estimate of x we 
will use the mean of the posterior, x = J xp(x|y; 7 , <r 2 )dx. 
Now the posterior could be computed as. 


p(x|y; 7 ,< 7 2 


:p(y|x)p(x| 7 ) 
1 


exp{_ 2^ l|y_t&x|l2 ”^^r } 


(34) 


7 i 


The challenge is proper normalization and tractability of the 
computation of the mean. For the EM algorithm to be suc¬ 
cessfully implemented, one must also be able to carry out the 
E-step, Equation {28}. We now explore this for some specific 
PESM family members. 
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1) Choice of p = 2: Choice of p = 2 corresponds to 
Gaussian Scale Mixture, and is very tractable. The GSM based 
Type II methods have been extensively studied |T5}, 1211 . 1371 
and so we keep the discussion brief. This choice (in Equation 


{34}) leads to a Gaussian posterior given by 


p(x|y; 7 , cr 2 ) = N(p, E) 

(35) 

where 


p = r$ T {a 2 1 + $T$ T ) _1 y 

(36) 

e = r - r$ T (o- 2 i + <f>r$ T ) _ 1 <i?r 

(37) 


and T = diag( 7 ). The EM algorithm can also be readily carried 
out because the E-step requires the second moment which can 
be readily obtained using Equation {37}. The estimate of 7 in 
the M step and the updates of 7 depend on the mixing density 
p( 7 ) as shown in Equation \21\ and can be readily carried out 
for the non-informative prior and for a reasonable large class 
of priors (29). The true posterior can be approximated by a 
Gaussian distribution whose mean and covariance depend on 
the estimated hyperparameters. Now, for a point estimate of 
the coefficient vector, we will choose, 

x = p. (38) 

From Equation (36) , one can see that p is sparse if 7 is sparse. 
To complete the discussion, we discuss the most popular of the 
Type II methods. In Relevance Vector Machine (Type II) CD, 
Tipping has shown that the ’true’ coefficient prior used in SBL 
actually follows a student t distribution (GSM with Gamma 
distribution as mixing density), and discusses in detail how 
the hierarchical formulation of this prior helps to realize the 
supergaussian nature. Hence we can see that the corresponding 
Type II formulation of Reweighted £2 is SBL with a slight 
difference. In SBL e is set to zero which gives us an improper 
prior p(x ) ~ l/|as| which is sharply peaked at zero. But as 
discussed in previous literatures, e = 0 in Type I version, i.e, 
in Reweighted 1 2 increases the number of local minima and 
convergence to a sub optimal solution becomes more likely. 
Now to solve the M step for this case we will use the following 
PESM (p = 2) formulation. 

Lemma 4.1: Let x ~ N( 0,7), 7 ~ Inverse — 

Gamma(e, e) Then the resulting marginal density for x is 
GT(V 2, 2, e) ~ Student - t{ 2e). 

Details of this inference procedure can also be found in CD, 
CD, and update rules have been shown in Table II. 

2) Choice of p = 1: With pm 1, PESM reduces to a Lapla- 
cian Scale Mixture. To successfully carry out the EM algo¬ 
rithm, the E-step requires the computation of E( |£i|;y, 7 ^). 
A closed form expression does not appear feasible and a 
more numerical approach may be required. Also, the concerned 
posterior (Equation [34} does not appear to have a simple closed 
form expression making final inferencing a challenge along 
with the computation of the mean for the point estimate. An 
efficient numerical approach needs to be developed and is left 
for future work. 

In this work, we follow an alternate strategy and take 
advantage of the fact that the LSM family is contained within 
the GSM family. Since a Laplacian distribution can be written 
as a member of the GSM family (Equation [ 6 } [16], |25 |, it will 



X 

Fig. 2: Tail behavior of Student’s t distribution for different 
values of degrees of freedom 

be possible to get a closed form posterior using a three layer 
hierarchy. We will illustrate this for the prior associated with 
Type I Reweighted £\-minimization approach and develop a 
Type II variant. The closed form posterior will be Gaussian 
and have the same form for the case of p = 2 as shown in 
Equation (35) . The only difference between p = 2 and p — 1 
lies in the estimation of the hyperparameters. 

Type II £\ variant can also be derived and has been dealt 
with in previous work EH and for sake of completeness the 
update rule is summarized in Table II along with other Type 
II algorithms. We will now derive the M step for the case of 
Type II Reweighted £\-minimization which can be followed 
in a straightforward manner for other cases including the l\ 
variant. 

We have shown in the discussion of Type I Reweighted £\ 
that the concerned prior GT(1, l,e) in a Bayesian setting is 
a Laplacian Scale mixture. This prior can be represented in 
a 3 layer hierarchy involving a GSM representation for the 
Laplacian density as summarized below. 

Lemma 4.2: Let x ~ N{ 0,7), 7 ~ Exp(^) and A ~ 
Ga(e, t) where e > 0. Then the resulting marginal density for 
x is GT(l,l,e). 

Fig. E compares two corresponding densities, GT(1, 1,1) 
and Laplace distribution with A = 1. It is evident from this 
figure that the Laplace prior has relatively light tails which 
contributes to the problem of over-shrinking of the large coef¬ 
ficients. On the other hand, the generalized t distribution has 
relatively heavier tails and a peak at zero which promotes zero 
coefficients. This is another reason of the superior recovery 
performance of Reweighted £\-minimization over the LASSO, 
i.e. l\ -minimization, approach. 

Now, for estimation of hyperparameters 7 and A in the three 
layer hierarchy, an EM algorithm will be developed. As in 
Section |IV-A| using (y, x) as the complete data, maximizing 
the conditional expectation of the complete data log likelihood 
involves maximizing, 

Q( 7 , A, a 2 ) = £ x | y ; T ,A,a 2 [logp(y, x; 7 , A, o 2 )] (39) 

In the E step, for iteration L, we only need to compute the 
second moment which is straightforward because of the GSM 










Type II algorithm 


Mixing Density 


Update Rules 


Type II l\ p(ji\\) = Exp(X/2) 

Type II ReTi (Candes) p( 7 i|A) = Exp( A 2 /2), p( A) = Gamma(e,e) 

Type II ReT 2 (Chartrand) p{ 7 i|e) = Inv — Gamma{e,e) 


7 i = 


- — l + x/l + 'tAiMf +^i,i) \ 2A 1 

- 2A ’ A - T~Ti 

-l+^/l+^i/if+Si.i) t _ -s+ v / f 2 +4(2M+£-l)El, 
2 A^ ’ A_ 2 E 7i 

/A+S; i+2e 

7» ~ 2e+l 


TABLE II: Updating Hyperparameters of Type II Algorithms 



X 


7 and A we obtain, 


Q( 7, A) = -j£ l0 ^ - 


S (M) + 
7i 


+ E( 2 log A “ + ” 1 ' ) log A “ eA 


(42) 


In the M step, taking the derivative of the Q function w.r.t 
7 i and A and setting to zero results in. 


dQ _ 1 £(*>*) + p 2 A 2 

cPp 27; 27? 2 


(43) 


Fig. 3: Comparison of tail behavior of two distributions: 
Generalized Double Pareto (GDP) and Laplacian 


Solving this quadratic equation we obtain, 

„ _ — 1 + \/l + 4A 2 (p? + Sj,j) 
7i = 2A 2 


(44) 



Fig. 4: Bar plot of the recovery performance for Type I and 
Type II Reweighted i\ (Candes et al) minimization 


representation of the Laplacian, i.e. 

^x|y;7* ,A,cr 2 I'D ] “t~ Pi 


(40) 


Similarly, 


5Q 

9A 


2M + e - 1 
A 


^ E 7i _ e = 0 

i 


Hence, 


A = 


-e+Ve 2 +4(2M + e-l)E I 7t 

2Ei7t 


(45) 


(46) 


We can also estimate the measurement noise variance a 2 by 
maximizing the above Q function as shown in ED. In this 
work, for simplicity, we will assume that the SNR of the 
environment is known to us before hand. We can also employ a 
fixed point optimization technique as shown in ED to estimate 
the hyperparameters. 

After convergence, one finds that most of the 7 ,;. i.e. the 
variance of the normal distribution are driven to zero, which 
makes the associated coefficient zero and prunes it out from 
the model. 


In the M step, the Q function is maximized with respect to 
the hyperparameters, 7 and A. 

Q( 7 , A) = £ x | y; ^ A , CT 2 [logp(yjx) + logp(x| 7 ) 

+ logp( 7 |A) + logp(A|e)] 

Now using the E step and only retaining the terms that involve 


V. Numerical Experiments 

In this section we present a set of experiments to evaluate 
and compare the Type II/Hierarchical framework based meth¬ 
ods with those based on regularization framework, i.e. Type 
I methods (MAP estimation), for the task of sparse signal 
recovery. The experimental setup used is quite standard and 
has been used widely in the SSR literatures. 
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Fig. 6: Recovery performance with Gaussian distributed non 
zero coefficients 


Fig. 5: Reconstruction of uniform spikes where k = 13 using 
(a) Original Signal, (b) £\ norm minimization (Type I), (c) 
Type II t\ minimization, (d) Candes et al (Type I) Reweighted 
1 1 minimization 

A. Problem Specification 

The measurement vector y is generated using a N x M = 
50 x 250 dictionary <f>, whose elements are generated from a 
i.i.d normal distribution with mean=0 and variance=l. A sparse 
signal Xjen of length 250 is generated such that ||x gerl ||o = k. 
The support, i.e. the location of the k nonzero elements, is 
chosen randomly, and the values are chosen from three different 
distributions: 

(I) Uniform ±1 random spikes. (Sub-Gaussian) 

(II) Zero mean unit variance Gaussian. 

(Ill) Student t distribution with degrees of freedom v = 3. 

(Super-Gaussian) 

The synthetic measurements are generated using y = 
<f>Xgen. The generated measurements and the dictionary are 
then provided as input to the algorithms. The estimated co¬ 
efficients are compared with the original x flen that has been 
used to generate the measurement. For a single instance, the 
method is credited with a successful recovery if the estimate 
x satisfies, 

||x gen - x||oo < 10 -3 (47) 

500 trials are conducted for various fixed combinations of k, 
i.e. the number of non zero coefficients, and the probability of 
successful recovery is plotted with respect to k. As expected, 
the probability of successful recovery decreases as k, i.e. the 
cardinality of support, increases. 

B. Recovery Performance 

1) Competing Algorithms: Since the main goal of our work 
is to compare the Type I algorithms with their Type I counter¬ 
parts, we designed the Type II versions of three well known 
norm minimization based Type I algorithms and compare their 
performance. The algorithms in the study are: 



Fig. 7: Recovery performance with Super Gaussian (Student t) 
distributed non zero coefficients 


• l\ minimization based SSR. (Basis Pursuit) 

• Type II l\ minimization based SSR. (Fixed A = 5) 

• Type I Reweighted l\ minimization, (e = 0.1 (3) 

• Type II Reweighted l\ minimization (Fixed t = 100) 

• Type I Reweighted £2 minimization, (e regularized, opti¬ 
mal update from (6)) 

• Type II Reweighted I 2 minimization (Fixed t = 0: SBL) 

2) Performance Comparison: In Figure [6] the probability 
of successful recovery with increasing support cardinality is 
plotted for the case where the non zero coefficients are from a 
zero mean, unit variance, Gaussian distribution. It is evident 
from this plot that for all the algorithms. Type II versions 
outperform their Type I counterparts. This performance dif¬ 
ference is significant in case of £\ norm minimization. Type 
I Reweighted £2 minimization approach works much better 
compared to other two Type I methods, and the reason being 
the heuristic update of e, which helps it to get away from 
local minima. Hence, t update in Reweighted £2 (Type I) 
is absolutely necessary as we have found out for fixed e 
this algorithm’s performance decreases significantly. Figure [4] 
shows this comparison for the Reweighted £\ minimization 
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Fig. 8: Recovery performance with Sub Gaussian distributed 
non zero coefficients 

(Candes et al) in detail. The figure indicates trials when both 
Type I and Type II method have been successful and when 
only one of them has been successful and it is evident that for 
high values of k, Type II outperforms Type I by a significant 
margin. 

In Figure [7] the probability of successful recovery with 
increasing support cardinality is plotted where the non zero 
coefficients are generated from a student’s t distribution with 
degrees of freedom 3. Again, the empirical superiority of the 
Type II versions over their Type I counterparts is evident from 
Figure [7] Interesting point to note here, is the performance 
improvement of Type I and Type II version of Reweighted £2 
algorithm over the others is significant and the reason could be 
that assumed prior for the non zero coefficients and the true 
prior have the same tail behavior (student’s t) and are better 
matched. 

Finally, we repeat the same set of experiments where the 
non zero coefficients follow a sub-gaussian distribution, i.e. 
Uniform ±1 random spikes, and the plot of the probability 
of successful recovery with increasing support cardinality is 
shown in Figure [8] Though Type II methods still outperform 
their Type 1 counterparts, the performance improvement is less 
significant compared to the previous two cases. The reason 
could be that, since the assumed priors are supergaussian, i.e. 
heavy tails, it is difficult to model the true prior, i.e. sub 
gaussian density, for the nonzero coefficients. In Figure [5] 
an instance of reconstruction is shown using k = 13 along 
with the original signal. It is evident that both l\ minimization 
(Type I) and Candes’s Reweighted l\ minimization (Type I) 
fail, whereas Type II version of l\ minimization recovers 
the original signal. For this instance, the other three SSR 
algorithms have also been successful in recovering the original 
signal. 

VI. Conclusion and Discussion 

In this paper, we formulated the SSR problem from a 
Bayesian perspective and presented two different Bayesian 
frameworks which encompass all the well known recovery 
algorithms in practice. We presented a generalized scale mix¬ 
ture family : PESM, which is of prime importance for the 


design of Hierarchical Bayesian Recovery algorithms, i.e. Type 
II algorithms. The unified treatment of both £\ and £2 norm 
minimization based algorithms along with the design of Type 
II version of the Reweighted £\ minimization algorithm are the 
main contributions of this work. 

We also showed that, in a hierarchical Bayes framework 
instead of looking for a mode of the true posterior Type II 
methods actually try to find an approximate posterior such that 
the mass of the true posterior over the subspace spanned by non 
zero indexes is maximized. This leads to a better approximation 
of the true posterior, which is the reason behind the superior 
empirical results obtained using the Type II framework. Type 
II framework also enjoys the robustness property inherited 
because of its connection with Hierarchical Bayes which allows 
one to be less concerned about the choice of prior on the 
hyperparameters. 
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